Interactive

  • applymap/apply/map

  • value_counts

  • list comprehension hconcat

import pandas as pd
import altair as alt
import numpy as np

Example with data from Spotify

Here we use the spotify_dataset.csv file from Canvas. The dataset originally came from Kaggle here. The Kaggle page includes a description of the columns.

We perform some “cleaning” of the dataset. By the end of Math 10, all of the following cell should be understandable, but for now, you shouldn’t worry about the details of this “cleaning”.

Important: You may need to change the path from data/spotify_dataset.csv, depending on where you have this csv file stored.

df = pd.read_csv("data/spotify_dataset.csv") # change path if necessary
df = df.replace(" ",np.nan)
df["Streams"] = df["Streams"].str.replace(",","")
df.iloc[:,[5,7]] = df.iloc[:,[5,7]].apply(pd.to_numeric,axis=0).copy()
df.iloc[:,12:22] = df.iloc[:,12:22].apply(pd.to_numeric,axis=0).copy()

Scatter plot

The following Altair chart is just like what we made above with our random DataFrame. We again use the column names to specify which parts of the data to use. Before we used column names like “a” and “b”. Here the column names are more descriptive, like “Energy” and “Loudness”.

df = df[df["Chord"].notna()].copy()
chords = sorted(list(set(df["Chord"])))
chords
['A',
 'A#/Bb',
 'B',
 'C',
 'C#/Db',
 'D',
 'D#/Eb',
 'E',
 'F',
 'F#/Gb',
 'G',
 'G#/Ab']
df["Chord"].value_counts().max()
214
df["Natural"] = df["Chord"].map(lambda x: 1 if len(x) == 1 else 0)
brush = alt.selection_interval(empty='none')

chart1 = alt.Chart(df).mark_circle().encode(
    x = "Energy",
    y = "Valence",
    color = 'Chord',
    tooltip = ["Artist","Song Name","Release Date","Chord"]
).add_selection(
    brush,
)

chart2 = alt.Chart(df).mark_bar().encode(
    x = alt.X("Chord",scale=alt.Scale(domain=chords)),
    y = alt.Y("count()",scale=alt.Scale(domain=[0,220])),
    color="Chord",
).transform_filter(
    brush,
)

chart1 | chart2
brush = alt.selection_single(empty='none',fields=["Chord"],on='mouseover')

chart1 = alt.Chart(df).mark_circle().encode(
    x = alt.X("Energy",scale=alt.Scale(domain=[0,1])),
    y = alt.Y("Valence",scale=alt.Scale(domain=[0,1])),
    color = 'Chord',
    tooltip = ["Artist","Song Name","Release Date","Chord"]
).transform_filter(
    brush
)

chart2 = alt.Chart(df).mark_bar().encode(
    x = alt.X("Chord",scale=alt.Scale(domain=chords)),
    y = alt.Y("count()",scale=alt.Scale(domain=[0,220])),
    color="Chord",
).add_selection(
    brush,
)

chart1 | chart2
brush = alt.selection_multi(empty='none',fields=["Chord"],on='click')

chart1 = alt.Chart(df).mark_circle().encode(
    x = alt.X("Energy",scale=alt.Scale(domain=[0,1])),
    y = alt.Y("Valence",scale=alt.Scale(domain=[0,1])),
    color = 'Chord',
    tooltip = ["Artist","Song Name","Release Date","Chord"]
).transform_filter(
    brush
)

chart2 = alt.Chart(df).mark_bar().encode(
    x = alt.X("Chord",scale=alt.Scale(domain=chords)),
    y = alt.Y("count()",scale=alt.Scale(domain=[0,220])),
    color="Chord",
).add_selection(
    brush,
)

chart1 | chart2

One of my favorite customizations in Altair is to use a more interesting color scheme. Here is an example using the color scheme “goldred”. You can find more color options in the Vega documentation.

alt.Chart(df).mark_circle().encode(
    x = "Energy",
    y = "Loudness",
    color = alt.Color('Acousticness',scale=alt.Scale(scheme="goldred")),
    tooltip = ["Artist","Song Name","Release Date","Chord"]
)

Sometimes the colors look more natural if they are reversed. We do that by adding reverse=True in the alt.Scale component.

alt.Chart(df).mark_circle().encode(
    x = "Energy",
    y = "Loudness",
    color = alt.Color('Acousticness',scale=alt.Scale(scheme="goldred",reverse=True)),
    tooltip = ["Artist","Song Name","Release Date","Chord"]
)

Spotify chart with tooltip

In the following chart we use a different color scheme, we specify the dimensions of the chart to make it a little bigger, and we give the chart a title.

alt.Chart(df).mark_circle().encode(
    x = "Energy",
    y = "Loudness",
    color = alt.Color('Acousticness', scale=alt.Scale(scheme='turbo',reverse=True)),
    tooltip = ["Artist","Song Name","Release Date","Chord"]
).properties(
    width = 720,
    height = 450,
    title="Spotify dataset from Kaggle"
)

Caution

The rest of this notebook can be skipped on a first reading. We give some more advanced examples.

Histogram

Here is an example of how to make a histogram using Altair. The heights of the bars indicate how many total entries there are in that category. The count() entry is not the name of a column. Instead it is a special Altair function to count how often that entry occurs.

alt.Chart(df).mark_bar().encode(
    x = "Artist",
    y = "count()"
)

There are so many artists, this chart is pretty difficult to interpret. Let’s restrict ourselves to the top artists.

Here are the top 19 artists. (Why 19 rather than 20? No great reason, but this particular chart looks better with 19.)

top_artists = df.Artist.value_counts()[:19]
top_artists
Taylor Swift          52
Justin Bieber         32
Lil Uzi Vert          32
Juice WRLD            30
Pop Smoke             29
BTS                   29
Bad Bunny             28
Eminem                22
The Weeknd            21
Drake                 19
Ariana Grande         18
Billie Eilish         18
Selena Gomez          17
J. Cole               16
Doja Cat              16
Dua Lipa              15
Lady Gaga             14
Tyler, The Creator    14
DaBaby                14
Name: Artist, dtype: int64

Let’s make our Altair chart using the sub-DataFrame with just these 19 top artists. We make this using a new pandas method, isin.

df_top = df[df.Artist.isin(top_artists.index)]
df_top.head()
Index Highest Charting Position Number of Times Charted Week of Highest Charting Song Name Streams Artist Artist Followers Song ID Genre ... Energy Loudness Speechiness Acousticness Liveness Tempo Duration (ms) Valence Chord Natural
6 7 3 16 2021-05-14--2021-05-21 Kiss Me More (feat. SZA) 29356736 Doja Cat 8640063.0 748mdHapucXQri7IAO8yFK ['dance pop', 'pop'] ... 0.701 -3.541 0.0286 0.23500 0.1230 110.968 208867.0 0.742 G#/Ab 0
8 9 3 8 2021-06-18--2021-06-25 Yonaguni 25030128 Bad Bunny 36142273.0 2JPLbjOn0wPCngEot2STUS ['latin', 'reggaeton', 'trap latino'] ... 0.648 -4.601 0.1180 0.27600 0.1350 179.951 206710.0 0.440 C#/Db 0
10 11 4 43 2021-05-07--2021-05-14 Levitating (feat. DaBaby) 23518010 Dua Lipa 27142474.0 463CkQjx2Zk1yXoBuierM9 ['dance pop', 'pop', 'uk pop'] ... 0.825 -3.787 0.0601 0.00883 0.0674 102.977 203064.0 0.915 F#/Gb 0
12 13 5 3 2021-07-09--2021-07-16 Permission to Dance 22062812 BTS 37106176.0 0LThjFY2iTtNdd4wviwVV2 ['k-pop', 'k-pop boy group'] ... 0.741 -5.330 0.0427 0.00544 0.3370 124.925 187585.0 0.646 A 1
13 14 1 19 2021-04-02--2021-04-09 Peaches (feat. Daniel Caesar & Giveon) 20294457 Justin Bieber 48504126.0 4iJyoBOLtHqaGxP12qzhQI ['canadian pop', 'pop', 'post-teen pop'] ... 0.696 -6.181 0.1190 0.32100 0.4200 90.030 198082.0 0.464 C 1

5 rows × 24 columns

alt.Chart(df_top).mark_bar().encode(
    x = "Artist",
    y = "count()"
)

Let’s add color to the chart, using the average number of Streams for each artist. In this example, mean is a special function in Altair, just like count.

Spotify bar chart

alt.Chart(df_top).mark_bar().encode(
    x = "Artist",
    y = "count()",
    color = "mean(Streams)"
)

Exercise

Copy the above histogram code, and replace mean with sum. Suddenly the colors are less interesting. Why do you think that is?

Interactive example

We end with an example just for inspiration. One of the distinguishing features of Altair is its support for interactivity. If you click and drag on the below chart, the points in the region you select will gain color.

brush = alt.selection_interval(empty='none')

chart = alt.Chart(df).mark_circle().encode(
    x = "Energy",
    y = "Loudness",
    color = alt.condition(brush,
                          alt.Color('Acousticness:Q', scale=alt.Scale(scheme='turbo',reverse=True)),
                          alt.value("lightgrey")),
).add_selection(
    brush,
).properties(
    width = 720,
    height = 450,
    title="Spotify dataset from Kaggle"
)

chart